FidaPLUS corpus of Slovenian
نویسندگان
چکیده
The paper describes the FidaPLUS corpus which is an upgrade of the Slovenian reference corpus. The corpus has been improved on various levels: size, up-todateness, quality of linguistic annotation (lemmatization, POS-tagging), availability and user-friendliness of the on-line concordancer. It has also been implemented in the Sketch Engine software which produces one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour. We will describe the history of the project and present the characteristics of the corpus and its tools.
منابع مشابه
The JOS Morphosyntactically Tagged Corpus of Slovene
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million word partially hand validated corpus. The two corpora have been sampled from the 600M word Slovene reference corpus FidaPLUS. The J...
متن کاملSlovene Word Sketches
Word sketches are one-page automatic, corpus-based summaries of a word's grammatical and collocational behaviour. They were first used in the production of the Macmillan English Dictionary (Rundell 2002). At that point, they only existed for English. Today, the Sketch Engine is available, a corpus tool which takes as input a corpus of any language and corresponding grammar patterns and which ge...
متن کاملContents and evaluation of the first Slovenian-German online dictionary
This paper presents the first SlovenianGerman and German-Slovenian online dictionary and contains evaluation figures for its Slovenian part. Evaluations are based on coverage of a Slovenian newspaper corpus as well as on user queries.
متن کاملBNSI Slovenian broadcast news database - speech and text corpus
This paper presents the BNSI Slovenian Broadcast News database project. The result of the project is a database with speech and text corpus oriented toward large vocabulary continuous speech recognition in general domain. The speech corpus consists of 36 hours of transcribed evening and late night news. The raw database material was captured in the archive of national broadcaster RTV Slovenia t...
متن کاملSlovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema
The paper describes the project whose main purpose is the creation of the Slovene terminology web portal, funded by the Slovene Research Agency and the Amebis software company. It focuses on the DTD/schema used for the unification of different terminology resources in different input formats into one database available on the web. Two projects involving unification DTD/schemas were taken as the...
متن کامل